-
Notifications
You must be signed in to change notification settings - Fork 41
feat: Implement Training Options pattern for flexible TrainJob customization #91
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat: Implement Training Options pattern for flexible TrainJob customization #91
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
67d12d8 to
b39b364
Compare
2d7a8e6 to
dbba135
Compare
36c0160 to
95155f6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @abhijeet-dhumal
I believe we need to handle how options will be applied for while backends. either we ignore options for localprocess or make options targeted towards specific backend.
f092845 to
03994e7
Compare
Pull Request Test Coverage Report for Build 18750401322Details
💛 - Coveralls |
6908d56 to
9f4098c
Compare
0687968 to
3b7f688
Compare
| metadata=models.IoK8sApimachineryPkgApisMetaV1ObjectMeta( | ||
| name=train_job_name, labels=labels, annotations=annotations | ||
| ), | ||
| spec=models.TrainerV1alpha1TrainJobSpec( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @abhijeet-dhumal,
I was looking for workarounds to overcome some of the Kubeflow SDK limitations (like specifying PVCs for particular training jobs, etc..) and came across your PR, which seems to fix those limitations. I cloned your fork and found some issues with this specific part.
It looks like KubernetesBackend.train() never forwards the spec_section.get("podSpecOverrides") into the TrainerV1alpha1TrainJob we send to the API. As a result, any per-job pod settings—PVC mounts, tolerations, node selectors—are silently dropped. The podSpecOverrides argument is not used in models.TrainerV1alpha1TrainJobSpec at all. Let me know what you think.
Thanks for your amazing work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@abhijeet-dhumal could it be due to the new field being podTemplateOverrides now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bohdandenysenko Thanks for catching this bug! You're absolutely right—the podSpecOverrides were being extracted from options but it wasn't used in TrainJob spec.
I've just pushed some fixes, Perhaps those might help!
@astefanutti Good catch — I see in Trainer the warnings ment for PodSpecOverrides deprecation, let me check.. If the upstream API has changed to podTemplateOverrides, I should update both the option class and the forwarding logic to match with the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bohdandenysenko All tests are passing now. Please check and let me know if you see any issues with the fix!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@abhijeet-dhumal Thanks for this amazing work!
This looks good to me overall. Could you please change to use PodTemplateOverride and address the few open comments.
/assign @kubeflow/kubeflow-sdk-team @briangallagher
| metadata=models.IoK8sApimachineryPkgApisMetaV1ObjectMeta( | ||
| name=train_job_name, labels=labels, annotations=annotations | ||
| ), | ||
| spec=models.TrainerV1alpha1TrainJobSpec( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@abhijeet-dhumal could it be due to the new field being podTemplateOverrides now?
kubeflow/trainer/__init__.py
Outdated
|
|
||
| # Import common training options (defaults to Kubernetes backend) | ||
| from kubeflow.trainer.backends.kubernetes.options import ( | ||
| PodSpecOverride, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would that be useful to have a dedicated kubeflow.trainer.options package?
That may be a way to consider getting rid of the With prefix, @kubeflow/kubeflow-sdk-team WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How would you remove the With prefix @astefanutti ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andreyvelich removing With from the type names below, like PodSpecOverrides instead of WithPodSpecOverrides.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, +1 on both the dedicated options package and removing the With prefix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @astefanutti @andreyvelich
I have added Options as a dedicated package now,
Where imports will look like this for the SDK user flow :
# Style 1: only works for Kubernetes options + common options, not LocalProcess options.
from kubeflow.trainer import (
Labels, Annotations, PodTemplateOverrides,
PodTemplateOverride, PodTemplateSpecOverride, ContainerOverride
)
# Style 2: the SDK validates options at runtime for respective backend types and rejects incompatible ones.
from kubeflow.trainer.options import (
Labels, Annotations, PodTemplateOverrides,
PodTemplateOverride, PodTemplateSpecOverride, ContainerOverride,
ProcessTimeout, WorkingDirectory # LocalProcess options also available
)
# Style 3: Backend-specific imports
from kubeflow.trainer.options.kubernetes import (
Labels, Annotations, PodTemplateOverrides,
TrainerImage, TrainerCommand, TrainerArgs
)
from kubeflow.trainer.options.common import (
PodTemplateOverride, PodTemplateSpecOverride, ContainerOverride
)
from kubeflow.trainer.options.localprocess import (
ProcessTimeout, WorkingDirectory
)
Is this seems correct approach here ?
30b5efc to
1a08bdd
Compare
…podSpecOverride Signed-off-by: Abhijeet Dhumal <abdhumal@redhat.com>
…alProcessCompatible) Signed-off-by: Abhijeet Dhumal <abdhumal@redhat.com>
…nd Option protocol Signed-off-by: Abhijeet Dhumal <abdhumal@redhat.com>
…ain() Signed-off-by: Abhijeet Dhumal <abdhumal@redhat.com>
…tainer overrides Signed-off-by: Abhijeet Dhumal <abdhumal@redhat.com>
8ea01ac to
a303e7d
Compare
Signed-off-by: Abhijeet Dhumal <abdhumal@redhat.com>
a303e7d to
fd89e68
Compare
…bSet/Jobs labels Signed-off-by: Abhijeet Dhumal <abdhumal@redhat.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a few comments, otherwise looks good to me.
| trainer: Optional configuration for a CustomTrainer or BuiltinTrainer. If not specified, | ||
| the TrainJob will use the runtime's default values. | ||
| options: Optional list of configuration options to apply to the TrainJob. Use | ||
| WithLabels and WithAnnotations for basic metadata configuration. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| WithLabels and WithAnnotations for basic metadata configuration. | |
| Labels and Annotations for basic metadata configuration. |
| def __call__( | ||
| self, | ||
| job_spec: dict, | ||
| trainer: Optional[Union["BuiltinTrainer", "CustomTrainer"]] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cannot that be the actual types rather than strings?
|
|
||
|
|
||
| @dataclass | ||
| class Name(KubernetesCompatible): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't that be in common?
What this PR does / why we need it:
Fixes #87, #92, #116
Sample import styles for Options :
Job Metadata
Container runtime customisation :
Pod level overrides:
Local process backend options:
Complete example :
Checklist: